1.1. Regression Models for Predicting Number of Views and Likes per Video¶

This notebook contains an exploration of the target variables in the datasets on data/processed/.

In [1]:
import os
import torch
import joblib
import warnings
import numpy as np
import pandas as pd
import torch.nn as nn
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from torch.utils.data import TensorDataset, DataLoader
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from youtube_trends.config import PROCESSED_DATA_DIR, MODELS_DIR
from sklearn.preprocessing import PolynomialFeatures, StandardScaler

pio.renderers.default = 'notebook_connected'

warnings.filterwarnings('ignore')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
2025-05-25 04:59:47.062 | INFO     | youtube_trends.config:<module>:11 - PROJ_ROOT path is: C:\Users\eddel\OneDrive\Documents\MCD\AAA\youtube_trends\venv\src\youtube-trends

Preparing training, validation, and testing datasets¶

Loading processed datasets df_train, df_val and df_test.

In [2]:
df_train = pd.read_csv(PROCESSED_DATA_DIR / 'train_dataset.csv', low_memory=False)
df_val = pd.read_csv(PROCESSED_DATA_DIR / 'val_dataset.csv', low_memory=False)
df_test = pd.read_csv(PROCESSED_DATA_DIR / 'test_dataset.csv', low_memory=False)
In [3]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)
(46337, 199)
(7626, 199)
(6736, 199)

Feature selection.

In [4]:
drop_cols = ['video_view_count', 'video_like_count', 'days_to_trend', 'video_published_at']

X_train = df_train.drop(columns=drop_cols)
X_val = df_val.drop(columns=drop_cols)
X_test = df_test.drop(columns=drop_cols)

Only numeric columns are maintained for regression models. If --translate=True was specified during dataset processing, the video_title_translated function is available for interpretation, but won't be considered in the models in this notebook.

In [5]:
X_train = X_train.select_dtypes(include=np.number)
X_val = X_val.select_dtypes(include=np.number)
X_test = X_test.select_dtypes(include=np.number)
In [6]:
X_train.describe()
Out[6]:
video_duration video_comment_count channel_view_count channel_subscriber_count published_dayofweek published_hour video_title_length video_tag_count sentiment_score sentiment_negative ... years lang_pca_0 lang_pca_1 lang_pca_2 lang_pca_3 lang_pca_4 video_category_pca_0 video_category_pca_1 video_category_pca_2 video_category_pca_3
count 46337.000000 46337.000000 4.633700e+04 4.633700e+04 46337.000000 46337.000000 46337.000000 46337.000000 46337.000000 46337.000000 ... 46337.000000 4.633700e+04 4.633700e+04 4.633700e+04 4.633700e+04 4.633700e+04 4.633700e+04 4.633700e+04 4.633700e+04 4.633700e+04
mean 1012.372273 3836.278438 6.040277e+09 1.197975e+07 3.197898 12.679932 9.143212 0.655329 0.049619 0.144204 ... 0.005082 -7.851132e-17 -3.680218e-17 1.242074e-17 -3.291112e-17 3.649070e-17 2.576153e-17 4.968294e-17 2.453479e-17 2.085457e-17
std 2819.329071 8609.104544 1.578495e+10 3.989381e+07 1.958441 5.814503 4.036997 0.493445 0.281998 0.351301 ... 0.064020 4.589096e-01 2.054438e-01 1.585670e-01 1.435839e-01 1.414110e-01 4.796688e-01 3.860894e-01 3.538891e-01 3.258885e-01
min 10.000000 0.000000 2.163470e+05 5.600000e+01 0.000000 0.000000 1.000000 0.000000 -0.897900 0.000000 ... 0.000000 -8.912856e-01 -3.484205e-01 -2.960471e-01 -5.788040e-01 -5.823643e-01 -3.966943e-01 -7.042180e-01 -4.341486e-01 -6.524535e-01
25% 36.000000 335.000000 2.397540e+08 6.100000e+05 1.000000 9.000000 6.000000 0.000000 0.000000 0.000000 ... 0.000000 -6.937320e-01 7.831350e-03 1.800121e-03 2.321304e-04 -2.709102e-04 -3.926991e-01 -7.191368e-03 -4.194579e-01 -3.912901e-02
50% 160.000000 1138.000000 1.181903e+09 2.650000e+06 3.000000 13.000000 9.000000 1.000000 0.000000 0.000000 ... 0.000000 2.746552e-01 7.831350e-03 1.800121e-03 2.321304e-04 -2.709102e-04 -2.536230e-01 -2.479025e-03 -6.956865e-02 -1.005693e-02
75% 984.000000 3205.000000 4.687050e+09 1.090000e+07 5.000000 17.000000 12.000000 1.000000 0.000000 0.000000 ... 0.000000 2.746552e-01 7.831350e-03 1.800121e-03 2.321304e-04 -2.709102e-04 7.745709e-01 2.370603e-03 1.999355e-01 1.919796e-02
max 42901.000000 82964.000000 2.970556e+11 3.960000e+08 6.000000 23.000000 25.000000 4.000000 0.944600 1.000000 ... 1.000000 2.746552e-01 8.052141e-01 8.589263e-01 7.906126e-01 7.749721e-01 7.745709e-01 7.099129e-01 6.165521e-01 7.542877e-01

8 rows × 195 columns

Standardization of the numeric features.

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

joblib.dump(scaler, MODELS_DIR / 'scaler_regression.pkl')
Out[7]:
['C:\\Users\\eddel\\OneDrive\\Documents\\MCD\\AAA\\youtube_trends\\venv\\src\\youtube-trends\\models\\scaler_regression.pkl']

Target features¶

In [8]:
y_train_dict = {'video_view_count': df_train['video_view_count'], 'video_like_count': df_train['video_like_count']}
y_val_dict = {'video_view_count': df_val['video_view_count'], 'video_like_count': df_val['video_like_count']}
y_test_dict = {'video_view_count': df_test['video_view_count'], 'video_like_count': df_test['video_like_count']}

Note: Since the number of likes per video is strongly correlated with the number of views, we'll assume an explicit relation of dependancy between these features; more specifically, that the number of likes depends on the number of views, being the likes less than or equal to the views. This assumption is based on the premise that YouTube counts any play of the video as a view, regardless of the time it was played. This assumption might not be entirely true, since YouTube counts views based on a combination of factors, including the number of times a video is watched, the duration of each view, and user interactions. While the exact algorithm is not disclosed, a view generally requires a viewer to watch a portion of the video, often a minimum of 30 seconds, and actively engage with the content.

In [9]:
X_train_like = np.concatenate([X_train_scaled, df_train[['video_view_count']].values], axis=1)
X_val_like = np.concatenate([X_val_scaled, df_val[['video_view_count']].values], axis=1)
X_test_like = np.concatenate([X_test_scaled, df_test[['video_view_count']].values], axis=1)

Regression models¶

To predict the number of views and likes we use the following regression models.

In [10]:
model_classes = {
    'Linear_Regression': LinearRegression,
    'Ridge': Ridge,
    'Lasso': Lasso,
    'Decision_Tree': DecisionTreeRegressor,
    'Random_Forest': RandomForestRegressor,
    'XGBoost': XGBRegressor
}

Training and visualization of predicted values ​​versus actual values.

In [11]:
results = {}
results_test = {}
n_target = 0

for name, ModelClass in model_classes.items():
    results[name] = {}
    results_test[name] = {}

    # video_view_count
    model_vv = ModelClass()
    model_vv.fit(X_train_scaled, y_train_dict['video_view_count'])
    y_pred_vv = model_vv.predict(X_val_scaled)
    mse_vv = mean_squared_error(y_val_dict['video_view_count'], y_pred_vv)
    r2_vv = r2_score(y_val_dict['video_view_count'], y_pred_vv)
    results[name]['video_view_count'] = {'MSE': mse_vv, 'R^2': r2_vv}
    joblib.dump(model_vv, MODELS_DIR /  f'{name}_views.pkl')

    # Evaluate on test set
    y_test_pred_vv = model_vv.predict(X_test_scaled)
    mse_test_vv = mean_squared_error(y_test_dict['video_view_count'], y_test_pred_vv)
    r2_test_vv = r2_score(y_test_dict['video_view_count'], y_test_pred_vv)
    results_test[name]['video_view_count'] = {'MSE': mse_test_vv, 'R^2': r2_test_vv}

    # video_like_count
    model_vl = ModelClass()
    model_vl.fit(X_train_like, y_train_dict['video_like_count'])
    y_pred_vl = model_vl.predict(X_val_like)
    mse_vl = mean_squared_error(y_val_dict['video_like_count'], y_pred_vl)
    r2_vl = r2_score(y_val_dict['video_like_count'], y_pred_vl)
    results[name]['video_like_count'] = {'MSE': mse_vl, 'R^2': r2_vl}
    joblib.dump(model_vl, MODELS_DIR / f'{name}_likes.pkl')

    # Evaluate on test set
    y_test_pred_vl = model_vl.predict(X_test_like)
    mse_test_vl = mean_squared_error(y_test_dict['video_like_count'], y_test_pred_vl)
    r2_test_vl = r2_score(y_test_dict['video_like_count'], y_test_pred_vl)
    results_test[name]['video_like_count'] = {'MSE': mse_test_vl, 'R^2': r2_test_vl}

    # --- Visualizations ---
    for target, y_pred, y_true in zip(['video_view_count', 'video_like_count'], [y_pred_vv, y_pred_vl], [y_val_dict['video_view_count'], y_val_dict['video_like_count']]):
        if n_target % 2 == 0:
            print(f'\nModel: {name}\n')
        fig = go.Figure()
        fig.add_trace(go.Scatter(x=y_true, y=y_pred, mode='markers', name='Predictions'))
        fig.add_trace(go.Scatter(x=y_true, y=y_true, mode='lines', name='Ideal'))
        fig.update_layout(
            title=f'{name} for {target} (Validation)',
            xaxis_title='True Values',
            yaxis_title='Predicted Values',
            template='plotly_white'
        )
        fig.show()

        filename = f'figure_{100+n_target}.html'
        filepath = os.path.join('iframe_figures', filename)
        fig.write_html(filepath)

        n_target += 1
Model: Linear_Regression

Model: Ridge

Model: Lasso

Model: Decision_Tree

Model: Random_Forest

Model: XGBoost

Results visualization (MSE and R^2 metrics)¶

In [12]:
for name, targets in results.items():
    print(f"\nValidation Results for model {name}:\n")
    for target, metrics in targets.items():
        print(f"{target} — MSE: {metrics['MSE']:.2f}, R^2: {metrics['R^2']:.4f}")
Validation Results for model Linear_Regression:

video_view_count — MSE: 222940215398476.88, R^2: 0.5820
video_like_count — MSE: 94001055879.53, R^2: 0.9225

Validation Results for model Ridge:

video_view_count — MSE: 222909980452372.00, R^2: 0.5820
video_like_count — MSE: 94031282749.22, R^2: 0.9225

Validation Results for model Lasso:

video_view_count — MSE: 222886591771235.97, R^2: 0.5821
video_like_count — MSE: 94092235818.60, R^2: 0.9224

Validation Results for model Decision_Tree:

video_view_count — MSE: 530837949859836.94, R^2: 0.0047
video_like_count — MSE: 382052774613.48, R^2: 0.6850

Validation Results for model Random_Forest:

video_view_count — MSE: 436456809751407.44, R^2: 0.1816
video_like_count — MSE: 429269284810.79, R^2: 0.6460

Validation Results for model XGBoost:

video_view_count — MSE: 411265869880897.25, R^2: 0.2289
video_like_count — MSE: 406664651226.69, R^2: 0.6647
In [13]:
for name, targets in results_test.items():
    print(f"\nTest Results for model {name}:\n")
    for target, metrics in targets.items():
        print(f"{target} — MSE: {metrics['MSE']:.2f}, R^2: {metrics['R^2']:.4f}")
Test Results for model Linear_Regression:

video_view_count — MSE: 205935824731273.69, R^2: 0.0030
video_like_count — MSE: 89175911251.04, R^2: 0.4316

Test Results for model Ridge:

video_view_count — MSE: 151289262543448.03, R^2: 0.2675
video_like_count — MSE: 71174926521.43, R^2: 0.5464

Test Results for model Lasso:

video_view_count — MSE: 117290294973885.66, R^2: 0.4321
video_like_count — MSE: 62341365195.37, R^2: 0.6027

Test Results for model Decision_Tree:

video_view_count — MSE: 244023322976896.06, R^2: -0.1814
video_like_count — MSE: 60328202903.22, R^2: 0.6155

Test Results for model Random_Forest:

video_view_count — MSE: 158461357677578.53, R^2: 0.2328
video_like_count — MSE: 29899911543.11, R^2: 0.8094

Test Results for model XGBoost:

video_view_count — MSE: 133287252779230.42, R^2: 0.3547
video_like_count — MSE: 28241817707.84, R^2: 0.8200
In [14]:
for metric in ["MSE", "R^2"]:
    for target in ["video_view_count", "video_like_count"]:
        values = [results[model][target][metric] for model in model_classes]
        fig = go.Figure(data=[
            go.Bar(x=list(model_classes.keys()), y=values)
        ])
        fig.update_layout(
            title=f"Validation {metric.upper()} Comparison for {target}",
            xaxis_title="Model",
            yaxis_title=metric.upper(),
            template="plotly_white"
        )
        fig.show()

        filename = f'figure_{100+n_target}.html'
        filepath = os.path.join('iframe_figures', filename)
        fig.write_html(filepath)

        n_target += 1
In [15]:
for metric in ["MSE", "R^2"]:
    for target in ["video_view_count", "video_like_count"]:
        values = [results_test[model][target][metric] for model in model_classes]
        fig = go.Figure(data=[
            go.Bar(x=list(model_classes.keys()), y=values)
        ])
        fig.update_layout(
            title=f"Test {metric.upper()} Comparison for {target}",
            xaxis_title="Model",
            yaxis_title=metric.upper(),
            template="plotly_white"
        )
        fig.show()

        filename = f'figure_{100+n_target}.html'
        filepath = os.path.join('iframe_figures', filename)
        fig.write_html(filepath)

        n_target += 1

Conclusions¶

Based on the validation results, the best models for predicting the number of likes (video_like_count) are the linear models—specifically Ridge and Lasso regression—which achieved high $R^2$ values around 0.92, indicating excellent predictive performance. For predicting the number of views (video_view_count), although all models showed moderate performance, the Lasso regression slightly outperformed others with an $R^2$ of 0.5821. Therefore, Ridge and Lasso are the most reliable models for predicting likes, while Lasso is the most suitable choice for estimating views during validation.

Considering that the splitting of training, validation, and test data was done based on the videos’ publication date and the compatibility of Ridge and Lasso with the test dataset, the best model for predicting both the number of views and likes is Lasso, although XGBoost showed better performance with the test dataset. This is because, if we choose to use a regression model to predict views and likes per video, we will actually only be using the training and validation dataset. In this textbook, we use the test dataset only to help us choose between Ridge and Lasso.